Dataset in use: TMDb movie data
I will try to answer the following questions
- Which genres are most popular from year to year?
- What kinds of properties are associated with movies that have high revenues?
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
# Read the data file
df = pd.read_csv('tmdb-movies.csv')
df.head(2)
Observation: We need to seperate genres data to make sure we can provide accurate statistics
# Display data statistics
df.describe()
# Checking the data types
df.dtypes
df.info();
# Check duplicate count
sum(df.duplicated())
# Check uniqueness
df.nunique()
# List the release years covered in the dataset
df['release_year'].unique()
Check for popularity outlier since we will use it as our dependent variable during the analysis. It looks like we have three points on the right that need to be removed.
# Check outliers in a box view
sns.boxplot(x=df['popularity']);
# Drop columns that contains many null values and not required for analysis
df.drop(['homepage', 'tagline', 'keywords', 'production_companies'], axis=1, inplace=True)
# deduplicate date
df.drop_duplicates(inplace=True)
sum(df.duplicated())
# Remove null cells
df.dropna(inplace=True)
df.shape
Removing popularity outlier
df = df[df.popularity < 20]
df.shape
df_genres = df.copy()
# Check if there is a movie that has a single genre
df_genres[~df_genres['genres'].str.contains('|')]
# Get a list of all unique genres in the dataset
hb_genres_series = df_genres['genres'].str.split('|').apply(pd.Series).stack().reset_index(drop=True).unique()
hb_genres_series.size
Count the number of movies per genres
# Create a dataframe to use genre as column index and release date in descendig order as column names
genre_year = pd.DataFrame(data=None,index=hb_genres_series,columns=df_genres['release_year'].sort_values(ascending=False).unique())
genre_year.fillna(value=0, inplace=True)
genre_year_popularity = genre_year.copy()
# Fill genre_year with the movies count for each genre
# row is the dataframe row
def count_genres(row):
for genre in row['genres'].split('|'):
genre_year.loc[genre, row['release_year']] += 1
df_genres.apply(count_genres, axis = 1)
genre_year.head()
# Adding Max column to hold the year of maximum genre
genre_year['Max'] = genre_year.idxmax(axis=1)
genre_year['Max']
Sum up the popularity for genre
# Fill genre_year with the movies sum of popularities for each genre
# row is the dataframe row
def genres_popularity(row):
for genre in row['genres'].split('|'):
genre_year_popularity.loc[genre, row['release_year']] += row['popularity']
df_genres.apply(genres_popularity, axis = 1)
genre_year_popularity.head()
# Adding Max column to hold the year of maximum genre
genre_year_popularity['Max'] = genre_year_popularity.idxmax(axis=1)
genre_year_popularity['Max']
# Find the correlation between variables
df.corr()
Revenue has a positively correlation with vote_count, budget, and popularity
Let's plot each variable to see how data looks like
df.plot(x='revenue', y='vote_count', kind='scatter');
df.plot(x='revenue', y='budget', kind='scatter');
The popularity of genre listed by year
# Plot the relation between genre and popularity for each year
fig, ax = plt.subplots(10, 6)
fig.set_figheight(35)
fig.set_figwidth(30)
for col in range(0, len(genre_year_popularity.columns) - 1):
genre_year_popularity.plot(y=genre_year_popularity.columns[col], kind="bar", ax=ax[int(col / 6), col % 6])
df.groupby(by='release_year').sum()['popularity'].plot(title='Release year relation with popularity');
The above plot show that the popularity is increasing from year to year. Thus, we can't use it for comparing popularity between years. A better way for comparing genre's popularity is to use count of each genre.
# Plot the relation between genre and movies count for each year
fig, ax = plt.subplots(10, 6)
fig.set_figheight(35)
fig.set_figwidth(30)
for col in range(0, len(genre_year.columns) - 1):
genre_year.plot(y=genre_year.columns[col], kind="bar", ax=ax[int(col / 6), col % 6])
For the last 12 years the most frequent genre is Drama followed by Comedy
plt.bar(genre_year.index.tolist(), genre_year_popularity[2015])
plt.bar(genre_year.index.tolist(), genre_year[2015], alpha=0.5) # Use alpha to make the front bar a little transparent to see what is behind
plt.title("Popularity vs Count for genre in 2015")
plt.xlabel("Genre")
plt.ylabel("Popularity/Count");
We can see that in 2015 the Drama is the most popular genre and also has the most number of movies produced. On the other side, we can see that action movies is the second popular genre but it is positioned the forth with respect to production count. So that we can see that the genre popularity is not always the motivation for producers.
plt.scatter(df[df.release_year==2015]['revenue'], df[df.release_year==2015]['budget'])
plt.title("Budget relation with revenue")
plt.xlabel("Revenue")
plt.ylabel("Budget");
Popularity is weakly corelated to revenue as it depends more on votes. This could show that popularity and votes are not the only source of revenu.
plt.scatter(df[df.release_year==2015]['revenue'], df[df.release_year==2015]['vote_count'])
plt.scatter(df[df.release_year==2015]['revenue'], df[df.release_year==2015]['popularity'])
plt.title("Revenue relation with popularity/votes")
plt.xlabel("Revenue");
It is obvious that the budget, vote_count, and popularity directly propotional to revenue in most cases. Let's take more insights by looking at the maximus revenue movie
df[df.revenue==df.revenue.max()]
| Value | Max value | |
|---|---|---|
| budget | 237000000 | 4.250000e+08 |
| vote_count | 8458 | 9767 |
| popularity | 9.432768 | 32.985763 |
The table above compare the values for maximum revenue movie with the absolutely maximum values of the three dependent variables we are analysing. The result confirm the correlation coefficient by example as vote_count is the strongest correlated to revenue and the nearest value to maximum.
The data not containing some data that could affect the analysis. An example of these data is